“wineQualityReds.csv” dataset contained 1599 observations and 13 variables. There is ‘X’ variable whose attribute is unknown. Remaining 12 variables described various properties of wines. Quality variable carried integer values from 3 to 10 which is quality ratings of wine from at least three wine experts (3 being worst rating and 8 being the best rating).
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
First variables were analysed using univariate analysis to get the feel of overall data distribution. This will help in making statistical assumptions in next steps. Univariate data analysis is a very useful way to check the quality and distribution of data and also to check for outliers.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Structure of the dataset gives basic strucutre of data in compact form in one line.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Summary gives results of basic statistics functions.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
On ploting the variables, we observed
We plotted Fixed Acidity, Volatile Acidity, Citric Acid, Free sulfur dioxide, Total sulfur dioxide and sulfates in log10 scale.
We created new variable total.acidity by taking the sum of fixed acidity, volatile acidity and citric acid to see if it shows any interesting pattern or interesting association.
Since Residual Sugar and Chlorides showed significantly largre number of outliers, we removed top 5% of the data and resulting plot looked fairly normal.
What is the structure of your dataset?
Red wine dataset consists of 1,599 observations of 12 variables which describe different chemical prperties of wine. 11 variables have numeric values whereas one variable, Quality, is an integer. Many of us enjoy wine without knowing the chemistry behind wine’s quality and taste. It is very interesting to know how quality of wine relates to a number of chemical compounds that are present in wine.
What is/are the main feature(s) of interest in your dataset?
To me understanding how quality of wine correlates with other chemical parameters will be very enticing.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
Several of the variables are interrelated (e.g. alcohol, density, Fixed Acidity, Volatile Acidity, Citric Acid, pH) in this dataset and change in one chemical parameter can have effect on the other. So I think these chemical constituents mainly alcohol and acidity will have dominant effect in wine quality.
Did you create any new variables from existing variables in the dataset?
I created a new variable Total acidity (Fixed Acidity + Volatile Acidity + Citric Acid), which is the sum of three variables fixed.acidity, citric acid and volatile.acidity.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Log transformation means taking a data set and taking the natural logarithm of variables. Sometimes when data does not quite fit the model we are looking for, a log transformation can help to fit a very skewed distribution into a more normal model. Such that, we can more easily see patterns in our data. Log transformation itself does not “normalize” our data but it can reduce skew if the data is highly skewed to the right. Some of the variables for e.g; Total Sulfur dioxide, Free Sulfur dioxide, Citric Acid, Volatile Acidity, Fixed Acidity had a positively skewed distribution. These plots after log10transformation, looked fairly normal. However Citric Acid which had positive skew on the regular graph, on log10 transformation it got shifted to the negative side. Not quite sure what this implies to. There were a few variables with large number of outliers ( residual sugar, chlorides). When Top 5 percent of data was removed, fairly normal graph was obtained.
Bivariant correlation matrix was created to explore positive and negative associations among variables.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.08
## 2 2 7.8 0.88 0.00 2.6 0.10
## 3 3 7.8 0.76 0.04 2.3 0.09
## 4 4 11.2 0.28 0.56 1.9 0.08
## 5 5 7.4 0.70 0.00 1.9 0.08
## 6 6 7.4 0.66 0.00 1.8 0.08
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 1 3.51 0.56 9.4
## 2 25 67 1 3.20 0.68 9.8
## 3 15 54 1 3.26 0.65 9.8
## 4 17 60 1 3.16 0.58 9.8
## 5 11 34 1 3.51 0.56 9.4
## 6 13 40 1 3.51 0.56 9.4
## quality total.acidity
## 1 5 8.10
## 2 5 8.68
## 3 5 8.60
## 4 6 12.04
## 5 5 8.10
## 6 5 8.06
The focus of this data exploration was to find chemical parameters affecting wine quality. However in our analysis we did not see any strong relationships between quality and other variables in bivariate correlation matrix plot. This plot showed - variables with positive correlations (r value > 0.45) were quality and alcohol, fixed acidity and density. Variables with negative correlations were alcohol and density, fixed acidity and pH. These correlations were further studied in our bivariate analysis. Variables with large number of outliers could be a reason that we did not see strong relationship between quality and chemical parameters. Besides, quality being a rating variable with integer values and with maximum number of wines confined to the score of 5 or 6, could explain the lack of strong correlation between quality and other variables.
According to the above plots,
Boxplots shows relationships between quality ratings and variables - pH, alcohol and density. More alcohol content, less dense and more acidic wines are considered high quality wines.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
According to the correlation matrix, graphs and boxplots, three parameters those showed strong correlations with quality were: density, alcohol, and pH/acidity. These plots revealed wines containing higher alcohol content (%/vol) were rated high in terms of quality. Furthermore, density and quality ratings of wines were found to be inversely proportional. This suggested lower the wine density, higher was the wine rating. Thirdly the pH of red wine varied between pH 2.8 and 4.0. Highly rated wines were in the more acidic side of the graph.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Other relationship observed was among variables describing acidity and pH. As fixed acidity, volatile acidity, citric acid, total acidity and pH all variables describe the acidic property of wine, these five variables showed some sort of association among themselves (for eg. fixed acidity and pH, volatile acidity and pH, citric acid and pH, total acidity and pH etc). Since their association with the quality variable was not very strong as shown by matrix graph, this association was not studied further.
What was the strongest relationship you found?
The strongest association involving quality variable (variable of interest) was between quality and alcohol content(r-squared value 0.48). Strongest association between any two variables was total acidity and citric acid (r-squared value 0.69).
Alcohol and pH relationship from above plots imply higher alcohol content with low pH make quality wines.
Wines having higher %/volume alcohol content and low density are high quality wines.
Above two plots show well understood and inverse relationship between acidity and pH. In the pH and fixed acidity plot no specific pattern was observed. In the plot depicting volatile acidity and pH relationship, we saw that high quality wines have pH between 3 and 3.5 and contain less volatile acidity.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + density, data = wine)
## m3: lm(formula = quality ~ alcohol + density + volatile.acidity,
## data = wine)
## m4: lm(formula = quality ~ alcohol + density + volatile.acidity +
## fixed.acidity, data = wine)
## m5: lm(formula = quality ~ alcohol + density + volatile.acidity +
## fixed.acidity + citric.acid, data = wine)
##
## ==========================================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------------------
## (Intercept) 1.875*** -33.152** -18.407 15.573 13.448
## (0.175) (10.878) (10.298) (15.187) (15.198)
## alcohol 0.361*** 0.391*** 0.333*** 0.311*** 0.316***
## (0.017) (0.019) (0.018) (0.020) (0.020)
## density 34.822** 21.360* -12.922 -10.845
## (10.813) (10.228) (15.214) (15.223)
## volatile.acidity -1.365*** -1.272*** -1.405***
## (0.096) (0.100) (0.116)
## fixed.acidity 0.045** 0.063***
## (0.015) (0.017)
## citric.acid -0.308*
## (0.137)
## ------------------------------------------------------------------------------------------
## R-squared 0.227 0.232 0.319 0.323 0.325
## adj. R-squared 0.226 0.231 0.318 0.321 0.323
## sigma 0.710 0.708 0.667 0.665 0.665
## F 468.267 240.693 248.893 189.939 153.348
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1715.878 -1619.631 -1615.017 -1612.485
## Deviance 805.870 800.668 709.855 705.771 703.539
## AIC 3448.114 3439.757 3249.261 3242.034 3238.969
## BIC 3464.245 3461.265 3276.147 3274.297 3276.609
## N 1599 1599 1599 1599 1599
## ==========================================================================================
After analysing the relationship of variables with univariate analysis, bivariante analysis and multivariate analysis, we built a syntax for linear model. We can use linear model to predict the quality value if a corresponding alcohol or other values are known. Before using this regression model, model was examined for its statistic significance. p values of the linear model and predictor variables (alcohol, fixed acidity, volatile acidity and citric acid) were less than 0.05.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
Fixed acidity and citric acid showed one of the strong positive association as per bivariate matrix plot. In the above plot of citric acid and fixed acidity with the points colored by wine quality category, we saw high quality wines confined towards one side of the plot.
Were there any interesting or surprising interactions between features?
To me surprising interaction was no interaction between quality and residual sugar variables. Residual sugar and chlorides are very important parameters in the quality of wine however in our study we found very weak association.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
I created a linear model using variables alcohol, density, volatile acidity, fixed acidity and citric acid as the predictor variables and wine quality as the outcome variable. The overall r-squared value for the model was quite low 0.324. However the model was statistically significant for variables alcohol, fixed acidity, volatile acidity and citric acid. The summary showed p-Values were less than 0.05, pre-determined statistical significance level.
The most important predictor variable in the model is alcohol. The limitation of this model would be the lack of diversity in the dataset for quality variable as more than 80 % of wines in the dataset were rated 5 or 6.
For the first plot, I chose the corrplot which gave both the big picture of association between variables and also the details of correlation values. Color function made distinction between positive and negative correlation.
Acidity plays important role in wine quality. Fixed acidity and citric acid showed positive association and are preferred parameters in good quality wines. In contrast, Volatile acidity is considered a flaw in wine making.
Among all the chemical attributes, alcohol had the strongest association with the quality rating of the wine (rsquare 0.476). Less dense (density < 1) wines with higher alcohol percentage by volume were more likely to get higher quality ratings.
This study explored Red wine dataset containing 1599 observations on 13 different attributes. Among 13 variables, 11 were chemical parameters which play important role in wine taste and quality. Main objective of this study was to explore relationship between quality and other chemical parameters. Using statistical methods and graphical analysis, different associations were studied between predictor and predicted variables. Despite many variables in the dataset, only very few showed strong relationship with quality:
These values were included in linear model (combined r-squared value = 0.3249). This low r-squared value implies that the interaction among variable was not very strong and this model would predict only 32% of wine quality. According to the definition of R-square, it is the percentage of the response variable variation that is explained by a linear model.
Our study revealed, wines containing higher alcohol percentage but less volatile acids were considered high quality wines. Besides, wines on the more acidic side and with less density were perceived better in the taste and quality.
Despite being a large dataset of 1599 observations, this dataset had drawback of limited variability. Quality variable which actually was wine ratings in the integer form - from 0 to 10. The distribution was so ununiform that more than 80 % of wines had the ratings of 5 or 6. There were 10 wines with ratings of 3, 53 wines with ratings of 4,199 wines with ratings of 7 and 18 wines with ratings of 8. Thus, there were not sufficient number of observations for the quality rating 8 or 7 or 4 or 3. Because of this limitation, it was very difficult to assess the relationship between quality variable and chemical parameters. Data would have been more useful and more insightful if the data was more uniformly distributed.
Irresective of these limitations, this dataset was very interesting and challanging to work with. Working with so many variables provided great opportunity to study different interactions. It would be more interesting to explore white wine dataset and compare variables and linear models between the datasets.
https://docs.google.com/document/d/1qEcwltBMlRYZT-l699-71TzInWfk4W9q5rTCSvDVMpc/pub?embedded=true https://www.practicalwinery.com/janfeb09/page2.htm https://en.wikipedia.org/wiki/Acids_in_wine http://www.statisticshowto.com/probability-and-statistics/skewed-distribution/ http://winefolly.com/update/sugar-in-wine-misunderstanding/ https://discussions.udacity.com/t/exploratory-data-analysis/249185 https://classroom.udacity.com/nanodegrees/nd002/parts/0021345407/modules/316518875375461/lessons/3165188753239847/project